Big Data Provenance: Challenges and Implications for Benchmarking
نویسنده
چکیده
Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data which we refer to as Big Provenance is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system’s ability to exploit commonalities in data and processing.
منابع مشابه
Big Data Provenance: State-Of-The-Art Analysis and Emerging Research Challenges
This paper focuses the attention on big data provenance issues, and provides a comprehensive survey on state-of-theart analysis and emerging research challenges in this scientific field. Big data provenance is actually one of the most relevant problem in big data research, as confirmed by the great deal of attention devoted to this topic by larger and larger database and data mining research co...
متن کاملBenchmarking Big Data Systems: State-of-the-Art and Future Directions
The great prosperity of big data systems such as Hadoop in recent years makes the benchmarking of these systems become crucial for both research and industry communities. The complexity, diversity, and rapid evolution of big data systems gives rise to various new challenges about how we design generators to produce data with the 4V properties (i.e. volume, velocity, variety and veracity), as we...
متن کاملCrossing Analytics Systems: A Case for Integrated Provenance in Data Lakes [Preprint, eScience 2016]
The volumes of data in Big Data, their variety and unstructured nature, have had researchers looking beyond the data warehouse. The data warehouse, among other features, requires mapping data to a schema upon ingest, an approach seen as inflexible for the massive variety of Big Data. The Data Lake is emerging as an alternate solution for storing data of widely divergent types and scales. Design...
متن کاملProvenance as Essential Infrastructure for Data Lakes [Preprint, forthcoming in IPAW 2016]
The Data Lake is emerging as a Big Data storage and management solution which can store any type of data at scale and execute data transformations for analysis. Higher flexibility in storage increases the risk of Data Lakes becoming data swamps. In this paper we show how provenance contributes to data management within a Data Lake infrastructure. We study provenance integration challenges and p...
متن کاملOn Big Data Benchmarking
Big data systems address the challenges of capturing, storing, managing, analyzing, and visualizing big data. Within this context, developing benchmarks to evaluate and compare big data systems has become an active topic for both research and industry communities. To date, most of state-of-the-art big data benchmarks are designed for specific types of systems. Based on our experience, however, ...
متن کامل